Predicting Unseen Triphones with Senones - Speech and Audio Processing, IEEE Transactions on

نویسنده

  • Mei-Yuh Hwang
چکیده

In large-vocabulary speech recognition, we often encounter triphones that are not covered in the training data. These unseen triphones are usually backed off to their corresponding diphones or context-independent phones, which contain less context yet have plenty of training examples. In this paper, we propose to use decision-tree-based senones to generate needed senonic baseforms for these unseen triphones. A decision tree is built for each Markov state of each base phone; the leaves of the trees constitute the senone pool. To find the senone associated with a Markov state of any triphone, the corresponding tree is traversed until a leaf node is reached. The effectiveness of the proposed approach was demonstrated in the ARPA 5000-word speaker-independent Wall Street Journal dictation task. The word error rate was reduced by 11% when unseen triphones were modeled by the decision-tree-based senones instead of contextindependent phones. When there were more than five unseen triphones in each test utterance, the error rate reduction was more than 20%. triphone that has sufficient context similarity. To incorporate this prediction capability, we propose to use decision trees for Markov state modeling. The individual output distributions, not the entire phonetic HMM's, are classified by decision trees with linguistic binary questions on the tree nodes. We build one decision tree for each Markov state of each base phone. In other words, we impose two constraints, for the sake of simplicity, while constructing our decision trees: phone dependency, which prohibits HMM output distributions of different phones from being clustered, and state dependency, which allows HMM output distributions to be merged only if they are associated with the same kth Markov state in the model topology. However, to relate the trees for different Markov states of the same base phone, global information about the entire phonetic model is utilized, which will be elaborated in Section 11-B. To determine the associated senone for a given triphone state, the corresponding tree is traversed until a leaf is reached. The traversal is guided by answering HE shared-distribution model (SDM) [9] in the SPHINXthe linguistic question associated with each nonleaf node, T 11 system [8] is an effective method to make full use which has a yes-child and a no-child corresponding to the of limited amount of training data. It clusters Markov states question. The senonic decision tree inherits the merits of the instead of entire phonetic hidden Markov models (HMM's), agglomerative SDM. More importantly, it also provides the leading to a set of clustered output distributions called senones ability to model unseen triphones. [ 1 SI. Senones significantly improve recognition accuracy and In the ARPA 5000-word speaker-independent Wall Street provide a pronunciation-optimization capability. Journal (WSJ) dictation experiments, we found that if unseen The agglomerative SDM approach in [91 has complete freetriphones were always represented by context-independent dom to form a shared configuration across different Markov phone models, the decision-tree-based senone performed, as states based on the training data. Because it is purely data expected, slightly worse than the agglomerative SDM since driven, it is difficult to model a triphone that never occurs in the latter had much more freedom in optimizing the clustering the training data. These unseen triphones are usually backed of seen triphones. However, when the unseen triphones were off to the corresponding diphones or context-independent modeled by the senonic decision tree, the tree-based approach phones, which contain less context yet have plenty of training was able to outperform the agglomerative SDM, which had no examples. In dictating large vocabulary tasks or switching to elegant method of modeling unseen triphones. We observed new tasks, we often run into new triphones. If we simply back an 11% error rate reduction when the senonic decision tree off to context-independent phones or diphones, the quality of predicted unseen triphones, which underlines the importance these less detailed models is often not good enough, leading of accurately modeling unseen triphones. When there were at to increased search time and search errors. Decision trees have least five unseen triphones for each utterance in the test set, been used to model allophones as a top-down generalization the error rate was reduced by more than 20%. Even when approach [3], [21, [6] that can be used to model unseen the test set contained few or no unseen triphones, modeling triphones. It replaces an unseen triphone with an existing unseen triphones accurately could help the decoder prune those wrong paths containing unseen triphones. The proposed with Similar imprOV€"ntS [19], [4], [25]. an improved version Of the bottom-up senonic decision tree. ~ i ~ ~ l l ~ , in Section IV, we present

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Predicting unseen triphones with senones

In large-vocabulary speech recognition, the decoder often encounters triphones that are not covered in the training data. These unseen triphones are usually represented by corresponding diphones or context independent monophones. We propose to use decision-tree based senones to generate needed senonic baseforms for unseen triphones. A decision tree is built for each individual Markov state of e...

متن کامل

Large Scale Distributed Acoustic Modeling With Back-Off ℕ-Grams

The paper revives an older approach to acoustic modeling that borrows from n-gram language modeling in an attempt to scale up both the amount of training data and model size (as measured by the number of parameters in the model), to approximately 100 times larger than current sizes used in automatic speech recognition. In such a data-rich setting, we can expand the phonetic context significantl...

متن کامل

Creation of unseen triphones from seen triphones, diphones and phones

With limited training data, infrequent triphone models for speech recognition will not be observed in suficient number. In this report, a speech production approach is used to predict the characteristics of unseen triphones by using a transformation technique in the parametric representation of a formant speech synthesiser. Two techniques are currently tested. In one approach, unseen triphones ...

متن کامل

Creation of unseen triphones from diphones and monophones using a speech production approach

With limited training data, infrequent triphone models for speech recognition will not be observed in sufficient number. In this report, a speech production approach is used to predict the characteristics of unseen triphones by concatenating diphones and/or monophones in the parametric representation of a formant speech synthesiser. The parameter trajectories are estimated by interpolation betw...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009